Parallelizing general histogram application for CUDA architectures

机译：并行化CUDa架构的一般直方图应用程序

页面导航

摘要
著录项
相似文献
相关主题

摘要

Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDA-capable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.

机译：直方图是数据分析中常用的工具。尽管其串行版本易于实现，但是提供一种高效且可扩展的方式来并行化它可能是具有挑战性的。由于域分解，全局内存的使用等问题，在包含一个或几个大规模并行设备（例如具有CUDA功能的GPU）的平台的情况下，尤其如此。在本文中，我们比较了两种在GPU上实现通用直方图的方法。第一种算法基于每个线程块在共享内存中存储的bin计数器的私有副本。第二个使用Thrust库对输入元素进行排序，然后根据bin宽度搜索上限。对于这两种算法，我们分析了顺序版本上的提速如何取决于输入集合的大小，容器数量以及输入元素的类型和分布。我们还通过内核执行实现了主机CPU和CUDA设备之间的数据传输重叠。对于这两种算法，我们都会详细分析其优缺点。例如，私有化策略的速度可能比使用实际输入的排序搜索快2倍，但只能支持有限数量的垃圾箱。另一方面，当我们使用字符作为输入并且可以支持无限数量的垃圾箱时，分类搜索策略的速度比私有化要快50％。最后，我们进行探索以根据输入参数的特征和值确定最佳算法。

著录项

作者
Milic Ugljesa; Gelado Fernandez Isaac; Puzovic Nikola; Ramírez Bellido Alejandro; Tomasevic Milo;
展开▼
作者单位

展开▼
年度 2024
总页数
原文格式 PDF
正文语种 eng
中图分类

相似文献

外文文献
中文文献
专利

1. NORMAN MATLOFF . Parallel Computing for Data Science: With Examples in R, C++, and CUDA . Boca Raton : CRC Press . NORMAN MATLOFF NORMAN MATLOFF . Parallel Computing for Data Science: With Examples in R, C++, and CUDA Parallel Computing for Data Science: With Examples in R, C++, and CUDA . Boca Raton Boca Raton : CRC Press CRC Press . [J] . Eddelbuettel Dirk Biometrics: Journal of the Biometric Society : An International Society Devoted to the Mathematical and Statistical Aspects of Biology . 2018,第2期

机译：Norman Matloff。数据科学的并行计算：在R，C ++和CUDA中使用示例。 Boca Raton：CRC压力机。 Norman Matloff Norman Matloff。数据科学的并行计算：使用R，C ++和CUDA的示例进行数据科学：R，C ++和CUDA中的示例。 Boca Raton Boca Raton：CRC按CRC压力机。
2. NORMAN MATLOFF . Parallel Computing for Data Science: With Examples in R, C++, and CUDA . Boca Raton : CRC Press . NORMAN MATLOFF NORMAN MATLOFF . Parallel Computing for Data Science: With Examples in R, C++, and CUDA Parallel Computing for Data Science: With Examples in R, C++, and CUDA . Boca Raton Boca Raton : CRC Press CRC Press . [J] . Eddelbuettel Dirk Biometrics: Journal of the Biometric Society : An International Society Devoted to the Mathematical and Statistical Aspects of Biology . 2018,第2期

机译：诺曼马特洛夫。数据科学的并行计算：r，c ++和cuda中的例子。 Boca Raton：CRC压力机。 Norman Matloff Norman Matloff。数据科学并行计算：使用R，C ++和CUDA的示例进行数据科学：在R，C ++和CUDA中使用示例。 Boca Raton Boca Raton：CRC按CRC压力机。
3. GPU Compute Unified Device Architecture (CUDA)-based Parallelization of the RRTMG Shortwave Rapid Radiative Transfer Model [J] . Mielikainen Jarno, Price Erik, Huang Bormin, Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of . 2016,第2期

机译：基于GPU计算统一设备架构（CUDA）的RRTMG短波快速辐射传输模型的并行化
4. Parallelizing general histogram application for CUDA architectures [C] . Milic Ugljesa, Gelado Isaac, Puzovic Nikola, International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation . 2013

机译：CUDA架构的并行一般直方图应用
5. A journey through performance evaluation, tuning, and analysis of parallelized applications and parallel architectures: Quantitative approach. [D] . Mustafa, Dheya G. 2013

机译：并行应用程序和并行体系结构的性能评估，调整和分析的过程：定量方法。
6. Parallelized Seeded Region Growing Using CUDA [O] . Seongjin Park, Jeongjin Lee, Hyunna Lee, 2014

机译：使用CUDA并行播种区域
7. Parallelizing General Histogram Application for CUDA Architectures [O] . 2016

机译：并行化CUDa架构的通用直方图应用程序

Parallelizing general histogram application for CUDA architectures

摘要

著录项

相似文献

相关主题

期刊订阅